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by 

Richard  L.  Dykstra  and  Tim  Robertson 
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ABSTRACT 

Algorithms  for  solving  the  isotonic  regression  problem  in 
two  dimensions  are  difficult  to  implement  because  of  the 
large  number  of  lower  sets  present.  Here  a  new  algorithm  for 
solving  this  problem  based  on  a  simple 
proposed  and  shown  to  converge  to  the 
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1.  INTRODUCTION.  Algorithms  for  calculating  the  least 
squares  Isotonic  regression  function  have  received  a  great  deal 
of  attention  In  the  literature  and  six  s_uch  algorithms  are  dis¬ 
cussed  in  Section  2.3  of  Barlow,  Bartholomew,  Bremner  and 
Brunk  (1972).  In  situations  where  there  Is  one  Independent 
variable  all  of  the  algorithms  work  very  efficiently.  Perhaps 
the  most  widely  used  algorithm  Is  the  "pool  adjacent  violators 
algorithm"  which  Is  applicable  only  in  the  case  of  a  simple 
linear  ordering  or  an  amalgamation  of  simple  orderings.  In 
many  isotonic  regression  problems  we  have  more  than  one  inde¬ 
pendent  variable  present  and  are  concerned  with  partial  order¬ 
ings.  An  important  example  involves  the  prediction  of  success 
In  college.  Usually,  this  prediction  is  based  upon  several 
independent  variables  such  as  rank  in  high  school  graduating 
class  and  score  on  a  standardized  examination  such  as  the  ACT 
composite  and  is  measured  in  terms  of  a  predicted  grade  point 
average  or  predicted  probability  of  obtaining  a  particular  GPA 
or  better.  The  predicted  value  is  usually  obtained  by  regres¬ 
sion  methods  and  is  assumed  to  be  nondecreasing  in  each  inde¬ 
pendent  variable.  The  isotonic  regression  function  has  been 
found  to  compare  very  favorably  with  other  techniques  with 
respect  to  predictive  accuracy  (cf.  Perrin  and  Whitney  (1976) 
and  Kolen  and  Whitney  (1978)). 

Some  of  the  algorithms  described  in  Barlow  et  al.  are 
applicable  to  the  case  of  computing  the  doubly  nondecreasing 


2 


least  squares  regression  function  but  the  number  of  computa¬ 
tions  required  can  become  prohibitive.  For  example,  consider 
the  minimum  lower  sets  algorithm  described  in  Section  2.3  of 
Barlow  et  al.  Suppose  one  of  our  two  Independent  variables  has 
a  possible  values  and  the  other  has  b  possible  values.  By 
counting  paths  from  the  upper  left  hand  corner  to  the  lower 
right  hand  corner  of  our  a  xb  grid,  it  follows  that  the  number 

of  lower  sets  is  equal  to  •  If  a  =  b  this  number  is 

—  1/2  a 

approximately  (air)  -k  by  Stirling's  formula.  Thus  if 
a  =  b  =  20,  and  if  consideration  of  each  lower  set  were  to 
require  one  microsecond  of  computer  time,  then  finding  the 
first  level  set  would  require  2312  minutes  or  38.5  hours  of  CPU 
time.  (One  microsecond  seems  conservative  in  light  of  the  fact 
that  computation  of  the  average  value  over  that  set  would  take 
at  least  two  multiplications,  two  additions  and  a  division  and 
the  comparison  would  require  a  subtraction.  The  present  stan¬ 
dard  for  making  such  predictions  is  four  arithmetic  operations 
per  microsecond.)  Moreover,  if  the  first  level  set  is  small 
(as  it  would  be  with  good  data)  the  second  cycle  is  nearly  as 
difficult  as  the  first,  etc. 

Since  the  doubly  nondecreasing  regression  function  is  so 
difficult  to  compute,  researchers  have  proposed  using  ad  hoc 
estimators  based  upon  one  dimensional  smoothings.  (The  number 
of  computations  required  for  one  dimensional  smoothings  is 
essentially  linear  in  the  number  of  entries.)  Makowski  (197^) 
studied  consistency  properties  of  estimators  obtained  by 

r 
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successive  one  dimensional  smoothings.  Kolen,  Smith  and  Whitney 
(1977),  Perrin  and  Whitney  (1976),  and  Kolen  and  Whitney  (1978) 
proposed  two  different  techniques  for  producing  estimates  which 
are  nondecreaslng  in  each  variable.  One  of  their  techniques 
was  to  first  do  one  dimensional  row  smoothings.  After  all  rows 
had  been  adjusted,  reversals  in  the  columns  were  adjusted  by 
the  same  method.  They  then  returned  to  the  original  table  and 
did  one  dimensional  column  smoothing  followed  by  row  smoothings. 
Neither  smoothing  necessarily  produces  a  doubly  nondecreasing 
table  so  they  averaged  the  two  results.  (The  average  is 

not  necessarily  doubly  nondecreasing  but  was  for  their  data.) 
This  method  was  applied  to  the  problem  of  estimating  the  prob¬ 
ability  of  obtaining  a  "B  or  better"  GPA  for  entering  college 
students.  The  data  is  presented  in  Table  1.  The  two  entries 
are  the  total  cell  frequencies  and  the  observed  relative  fre¬ 
quencies.  We  note  that  there  are  a  number  of  "reversals," 
even  with  a  relatively  large  sample  size.  The  smoothed  esti¬ 
mates,  by  their  method,  are  presented  in  Table  2  and  the  iso¬ 
tonic  regression  function  with  weights  equal  to  frequencies 
in  Table  3-  Note  that  not  only  the  estimates  but  also  the  level 
sets  are  different.  These  level  sets  are  very  useful  for  making 
Inferences  about  equivalent  scores  within  the  table. 

Several  of  the  algorithms  discussed  in  Section  2.3  of 
Barlow  et  al.  are  basically  methods  of  finding  linear  orders 
which  are  consistent  with  a  partial  order.  We  have  not  been 
able  to  write  a  program  which  implements  any  of  these  methods 


TABLE  1 


The  probability  of  making  a  "B  or  better"  GPA 
(top  ntunber  =  total  cell  frequency;  bottom  number  =  relative  frequency) 


ACT 

High  School 

Grade 

Point  Average 

Composite 

0  to  1.55 

1.56  to  2.25  2.26  to  2 

1.95  2.96  to  3.65 

3.66  to  u.oo 

+ 

oo 

CM 

0 

7 

10 

1+7 

hk 

.0000 

.2857 

.2000 

.57^*5 

.8Q6k 

23-27 

7 

56 

88 

180 

Qk 

.0000 

.1250 

.1818 

.2833 

.5238 

18-22 

23 

166 

152 

1U9 

33 

.OU35 

.0301 

.072U 

.191+6 

.1212 

13-17 

27 

IU9 

96 

61 

It 

.0000 

.0U70 

.0313 

.OU92 

.5000 

0-12 

10 

.0000 

^7 

.0000 

33 

.0606 

7 

.0000 

0 

.0000 

TABLE  2 

The  probability  of  making  a 
Kolen  and  Vfhitney 

"B  or  better 
Method 

"  GPA 

.0311+ 

.2353 

.2353 

.571+5 

.8861+ 

.0311+ 

.1250 

.1818 

.2833 

.5238 

.0311+ 

.0375 

.0721+ 

.1867 

.1931+ 

.0375 

.0lt02 

.0lt93 

.1781* 

.0383 

.01+21 

.01*25 

TABLE  3 

The  probability  of  making  a  "B  or  better"  GPA 
Least  squares  isotonic  regression  (weights  =  cell  frequencies) 


for  a  doubly  nondecreasing  regression  function  In  a  reasonable 
amount  of  time. 

In  this  paper  we  present  an  algorithm  for  calculating  the 
least  squares  Isotonic  regression  function  which  Is  Increasing 
In  each  of  two  variables.  This  algorithm  uses  successive  one 
dimensional  smoothings  and  Is  very  efficient  and  easy  to  program. 
To  Illustrate,  we  applied  the  algorithm  to  a  20  by  20  table  of 
random  numbers.  The  Isotonic  regression  was  obtained,  correct 
to  four  significant  digits,  after  400  Iterations,  required  39 
seconds  of  CPU  time  at  a  cost  of  one  dollar  and  thirty-five 
cents.  This  algorithm  Is  described  In  Section  3.  In  Section  2 
we  summarize  some  well  known  properties  of  Isotonic  regression 
which  will  be  used  In  the  proof  that  the  algorithm  yields  the 
desired  result. 


2.  SOME  PRELIMINARIES.  We  let  n=  {(i,J);  i=l,2,---,a; 
J  *1,2," ••,b 3  and  define  the  partial  order  «  on  0  by 
(ijJ)  «  (k,^)  if  and  only  if  1-k  s  0  and  ^  0.  We 

denote  an  arbitrary  real  function  whose  domain  is  0  as  a 
matrix ,  1 . e • , 


G  =  =  (g((l,J))),  1  =l,2,---,a;  J  =l,2,--.,b. 

(Note  that  this  is  not  the  usual  matrix  notation  where  g^j 
refers  to  the  entry  In  the  1—  row  and  J~ 


column . ) 


We  say  that  a  function  P  :  Q  — ♦  R  Is  isotonic  or  order  preserv¬ 
ing  if  (i,J)  «  (k,-^)  implies  ^ij  ^  This  Is  equiva¬ 

lent  to  requiring  that  F  be  nondecreasing  along  both  rows 
and  columns.  The  least  squares  Isotonic  regression  problem  Is 
to 

Minimize  21^ 


for  P  belonging  to  the  class,  K,  of  Isotonic  functions 


where  w^j  >  0  and  G  are  given. 

Since  the  class  of  isotonic  functions  forms  a  closed  convex 
cone.  It  is  well  known  (for  example,  see  Theorem  7. 8  in  Barlow 
et  al.)  that  the  solution  to  the  Isotonic  regression  problem, 


say  G*, 

is  characterized  by  the 

(2.1) 

(2.2) 

Wj 


=  0, 


and 


w,  .  ^  0  , 


for  all  functions  H  €  K. 
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3.  THE  ALGORITHM.  The  algorithm  which  we  propose  requires 
only  the  ability  to  solve  the  isotonic  regression  problem  with 
the  usual  nondecreasing  order  (In  one  dimension)  along  rows  and 
columns.  Our  algorithm  Is  given  as  follows: 


1.  Let  denote  the  isotonic  regression  solu- 

Ac  1 ) 

tlon  of  G  =  over  rows,  l.e.,  G  minimizes 


"IJ  hi 


£■  f  ,  for 
aj 


j=l,---,b.  We  call 


the 


1 

I 


first  set  of  "row  increments." 
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Let  denote  the  isotonic  regression  solu¬ 
tion  over  columns  from  the  initial  values  G+R^^^,  i.e., 

“3^^^  minimizes  L  (g^j  subject  to 

5  f^2  ^  ^  f  fo^  i  =  1  >  —  We  call 

-  (G  the  first  set  of  "column  Increments." 

Note  that  =  G+R^^^+C^^^. 


th  n ) 

3.  Etc.  At  the  beginning  of  the  n  cycle,  we  obtain  G 

by  isotonizing  G+C^”~^^  over  rows.  The  n^*^  set  of  row 
Increments  is  defined  by  R^”^  =  G^'^^  -(G+C^'^  so  that 

s  G+C^’^”^^  +R^^\  We  then  obtain  by  isotoniz¬ 

ing  G+R^”^  over  columns.  The  n^^  set  of  column  incre¬ 
ments  is  given  by  -(G+R^’^^),  or  equivalently 


The  utility  of  the  algorithm  lies  in  the  following  theorem. 

Theorem  3.1.  Both  G^”^  and  G^^^  converge  to  the  true  solu- 
tlon  G  as  n  -•  «. 


Proof.  If  we  denote  the  inner  product  norm  as 


IlFll 


1 

(F,P)^ 


a  b 

(  E  E 
1=1  J=l 


1 


> 


we  first  show 
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(3.1)  ^  ^  for  all  n. 

To  establish  some  additional  notation,  we  denote  the  "row  cone" 

by 

"r  =  hj  "faj  3 

and  the  "column  cone"  by 


K 

c 


tP; 


s  for  1  =  1 ,  •  •  •  ,a3 . 


The  respective  dual  cones,  as  discussed  in  Barlow  and  Brunk 
(1972),  are 


K*w 

r 


{H; 


so 


for  j  =  1 ,  •  •  •  ,  b  ;  for  every  F  €  ] 


and 

K»w 

c 


b 

[H;  Z  h .  .  f  w  .  s  0  for  i  =  1 ,  •  •  •  ,a ;  for  every  F  6  K  1 . 


Since  is  the  projection  of  G  onto  K*'^,  the 

work  of  Barlow  and  Brunk  guarantees  that 


minimizes  j]  G  +  -  F|1^  for  F  6  K*”. 

Similarly , 

minimizes  ||G  +  R^"^  -  F|1^  for  F  e  K*”. 
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Therefore,  since  and  €  K*”, 

’  r  c  * 

llG  )  11^  a  jlG  +R^'^^  -  )  11^  i  llG  +  -  (-R^"'*’^^ )  H^ 

which  is  equivalent  to  (3.1). 

Next  we  show  that  and  are  bounded.  If 

not,  let  (iQ,jQ)  be  a  minimal  point  in  0  (with  respect  to 
our  partial  ordering  <<)  such  that  either  {r:  .  ]  or 

3  is  unbounded.  Say  there  exists  a  subsequence  {n.  3 

Iq 

such  that  r.  — ►-».  (Since  E  ^  0  for  all 

IqjJo  i=l 

(n-t  ) 

n  (see  Barlow  and  Brunk  (1972)),  n,  would  contradict 

Io’-Jq 

the  fact  that  (iQ,J^)  is  minimal.)  But  this,  together  with 

G^^^  =  G+R^*^^  and  the  fact  that  G^"^^  is  bounded  in 

(n  ) 

norm  (cf.  (3.1))  implies  that  C.  .  — »  ®.  This,  in  turn, 

contradicts  the  fact  that  (i^jJ^)  is  minimal  since 
^  0 

E  s  0  for  all  n. 

J  =  1  ^0”^ 

Projections  onto  convex  sets  are  distance  reducing  so  that 
Ijcd)  _  c(i-l)  ji^  =  |[G  -  (G+C^^“^^ )  11^ 

(3.2)  s  IIR^^'*'^^  -R^^^lP=  llG  -  (G+R^^^)ll^ 

i  for  all  i. 


We  now  show  that 
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(3.3)  (and  hence  0 

as  i  ®. 

If  (3.3)  were  not  the  case,  there  would  exist  (i^jJ^)  €  fi 
and  e  >  0  such  that 

(3.^)  -rf^^.  I  >  e  for  infinitely  many  1. 

However,  since  is  bounded,  there  exists  a  finite  M 

such  that 

(3.5)  |r|^\  -r’P^  I  <  M  f'or  all  1,J. 

If  we  write 

(3.6) 

=  |1g+r^^^+c^^^-  11^ 

+  2(G+R^^^  +  C^^^-(G  -C^^^), 

the  left  side  of  (3.6)  converges  to  0  since  by  (3.2)  both 
terms  converge  to  the  same  quantity.  The  last  term  of  the  right 
side  is  nonnegative  by  (2.1)  and  (2.2).  Thus 

(3.7)  (R^^"*"^^  -R^^^)  +(0^^'*’^^  —♦0  as 

In  similar  fashion,  beginning  with 


Ijcd+l)  _c(i)„2_||j^(l+2) 


we  can  conclude 


(3.8)  0  as  !-.< 

Subtracting  (3.7)  from  (3-8)  yields 

-♦O  as  i— 


Thus,  for  a  sufficiently  large  Nq  and  fixed  n^,  we  can  keep 


(r 


(N^+1+1)  (Nq+1) 

^0’'^0  ^0’'^0 


)  )  i*l,2,’*’,nc 


arbitrarily  close  to 


(N„.l)  (»„) 

'  ^  i  i  1  '  * 

0”^0  O’^^O 

This,  however,  contradicts  (3.^)  and  (3-5)  both  being  true. 

Since  {R^^^3  and  are  bounded,  there  must  exist 

(n.,  )  (n^) 

convergent  subsequences.  Suppose  R  ^  ♦  R  and  C  — •  C. 

Then,  in  light  of  (3.3), 


^(n.)  (n. )  (n,  ) 

G  =G+R  ^+C  ^ 


and 
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A(n^+1) 

G  ^ 


(n,+l)  (n  ) 
G  +R  ^  +C  ^ 


both  converge  to 

G  = G+R+C  (in  anticipation  of  this  being  the  desired  solution) 

Since  ^  is  an  element  of  and  is  an  element  of 

3.11  n,  we  know  that  G*  ^  fl  (these  cones  are 
closed).  Furthermore, 

(G-G*,G*)  *  (G+R-G*,G*)  -  (R,G*) 


=  llm(G+R 
i  “♦« 


^  -G  ^  ,G  ^  )  +  llm(G+C  ^  ^  ^  ) 

i-*oo 


=  0+0. 


Similarly,  if  V  €  fl  K^, 

(G-G*,V)  =  (G+R-G*,V)  -  (R,V) 

(n  )  Jn  )  (n  )  .(n  +1) 

=  llm(G+R  ^  -G  ^  ,V)  +  lim(G+C  ^  -0  ^  ,V)  s  0+0. 

i-*®  l-*oo 

Thus  G*  is  the  desired  solution  by  (2.1)  and  (2.2).  Moreover, 
since 

-C  minimizes  Hg+R-f]!^,  P  € 

0 

and 


-R  minimizes  |1g+C-P|1  ,  F  €  K*  , 


we  may  use  the  distance  reducing  property  of  projections  to  say 


-cll^  =  llG +c^”^-{G+c)ll^  ^ 


=  llG  -(G+R)|l^  -C|l^  for  all  n, 


Thus 


R^*^^  — ►R  and  — >0  as  n 


which  implies  that 


=  G  +R^^^  and  =  G  +R^"^ 


both  converge  to 


G  =G+R+C  as  n 


4.  OTHER  POINTS.  It  is  important  to  note  that  the  solution 
G*  =  G  +R  +C  does  not  uniquely  determine  R  and  C.  In  fact, 
if  we  begin  with  a  column  smoothing  rather  than  row  smoothing 
we  will  obtain  different  limiting  values  from  R  and  C  even 
though  the  same  limiting  G*  is  obtained. 


As  one  would  expect,  this  procedure  works  equally  well  when 
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the  order  restrictions  are  modified  to  require  nonincreasing 
rows,  or  nonincreasing  columns,  or  both.  One  has  only  to 
change  the  one  dimensional  smoothing  to  operate  in  the  appro¬ 
priate  direction.  The  procedure  can  also  be  adapted  to  higher 
dimensions,  although  in  this  case  the  number  of  required 
smoothings  quickly  becomes  large. 

We  also  wish  to  point  out  that  G*  Itself  solves  many  more 
minimization  (maximization)  problems  than  the  least  squares 
problem  stated  above.  For  example,  from  Theorem  1.10  of  Barlow 
et  al.  ,  if  i  is  an  appropriate  convex  function  and  cp  is  a 

n 

subgradient  (basically  a  derivative)  of  i,  then  G  solves 
the  problem 

a  b 

(^.1)  Maximize  L  I  {^(f^  ^ )  +  (g^  ^  )cp(f‘^  ^ )  3w,  . . 

F€K  HK  1=1  J  =  1  ^ 

re 

Along  somewhat  similar  lines.  Theorem  3.1  of  Barlow  and 
Brunk  guarantees  that  the  problem 

a  b 

(4.2)  Minimize  E  E  ( ^ ( f . . )-g. . f . . )w 

F€K  OK  1=1  j  =  l 
r  c 

is  solved  by  (cp  where  once  again  i  is  an  appropriate 

convex  function  and  cp  is  a  subgradient  of  i. 

II 

Thus  G  solves  a  much  wider  range  of  problems  than  is 
readily  apparent.  For  example,  suppose  one  has  a  multinomial 
random  vector  X^j ,  where  the  cell  probabilities  p^j  are 
placed  in  a  rectangular  grid  and  one  wishes  to  find  the  maximum 
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likelihood  estimators  for  the  p^j  subject  to  nondecreasing 
(nonincreasing)  rows  and  columns.  This  problem  can  be  phrased 
in  terms  of  (4.1)  from  which  it  follows  that  the  solution  is 
given  by  G  where  G  =  (Xj^j/n)  and  w^^j  =  1. 

Similarly,  if  the  X^j  are  independent  binomial 
random  variables,  one  can  show  that  the  maximum  likelihood  esti¬ 
mators  for  the  p^j  subject  to  nondecreasing  (nonincreasing) 
rows  and  columns  is  given  by  G*  where  G  =  (X^j/n^j  ) ,  and  w^^  ”  *^i j 

Finally,  in  order  to  illustrate  the  algorithm  on  a  larger 
table, we  considered  the  data  presented  in  Table  4.  The  entries 
are  the  first  year  grade  point  averages  of  2397  students  who 
entered  the  University  of  Iowa  in  the  Pall  of  1978.  The  inde¬ 
pendent  variables  are  the  composite  scores  on  the  ACT  Assess¬ 
ment  and  the  student's  high  school  percentile  rank.  The 
expected  first  year  grade  point  average  is  assumed  to  be  a  non- 
decreasing  function  of  both  of  these  independent  variables. 

(The  number  in  parentheses  is  the  number  of  students  in  the 
category . ) 

The  least  squares  solution,  correct  to  four  significant 
digits,  was  obtained  after  500  iterations  (250  row  smoothings 
and  250  column  smoothings)  at  a  cost  of  9  seconds  CPU  time  and 
84  cents.  These  results  are  given  in  Table  5  with  the  level 
sets  indicated.  Since  the  cost  of  our  algorithm  is  essentially 
linear  in  the  number  of  points  in  the  grid,  even  very  large 
arrays  can  be  isotonlzed  at  a  reasonable  cost. 
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